The data provided for the assignment is the policing equity data set, that has information about the incidents and the details of police force that dealt with those incidents in the region of Dallas, Texas. The goal of this study is to dig deep into the data and find out if there are any racial disparities or injustice happening within the police force.
df = read.csv(file = "37-00049_UOF-P_2016_prepped.csv", skip = 1)
Exploring the data to check for the type of data, the shape of the data frame and also checking for any missing values.
#Checking for the count of missing values per column
missing_vals <- sapply(df, function(y) sum(length(which(is.na(y)))))
missing_vals <- as.data.frame(missing_vals)
missing_vals
## missing_vals
## OCCURRED_D 0
## OCCURRED_T 0
## UOFNum 0
## CURRENT_BA 0
## OffSex 0
## OffRace 0
## HIRE_DT 0
## INCIDENT_DATE_LESS_ 0
## OFF_INJURE 0
## OFF_INJURE_DESC 0
## OFF_HOSPIT 0
## CitNum 0
## CitRace 0
## CitSex 0
## CIT_INJURE 0
## SUBJ_INJURE_DESC 0
## CIT_ARREST 0
## CIT_INFL_A 0
## CitChargeT 0
## RA 0
## BEAT 0
## SECTOR 0
## DIVISION 0
## DIST_NAME 0
## STREET_N 0
## STREET 0
## street_g 0
## street_t 0
## Street.Address 0
## City 0
## State 0
## Latitude 55
## Longitude 55
## SERVICE_TY 0
## UOF_REASON 0
## ForceType1 0
## ForceType2 0
## ForceType3 0
## ForceType4 0
## ForceType5 0
## ForceType6 0
## ForceType7 0
## ForceType8 0
## ForceType9 0
## ForceType10 0
## Cycles_Num 0
## ForceEffec 0
There are no misisng values in any of the columns except for the Latitude and Longitude. Both these columns have 55 missing values. Some columns do have NULL values.
str(df)
## 'data.frame': 2383 obs. of 47 variables:
## $ OCCURRED_D : chr "09/03/2016" "3/22/16" "5/22/16" "01/10/2016" ...
## $ OCCURRED_T : chr "4:14:00 AM" "11:00:00 PM" "1:29:00 PM" "8:55:00 PM" ...
## $ UOFNum : chr "37702" "33413" "34567" "31460" ...
## $ CURRENT_BA : int 10810 7706 11014 6692 9844 9855 9881 9058 10381 9705 ...
## $ OffSex : chr "Male" "Male" "Male" "Male" ...
## $ OffRace : chr "Black" "White" "Black" "Black" ...
## $ HIRE_DT : chr "05/07/2014" "01/08/1999" "5/20/15" "7/29/91" ...
## $ INCIDENT_DATE_LESS_: int 2 17 1 24 7 7 7 9 4 8 ...
## $ OFF_INJURE : chr "No" "Yes" "No" "No" ...
## $ OFF_INJURE_DESC : chr "No injuries noted or visible" "Sprain/Strain" "No injuries noted or visible" "No injuries noted or visible" ...
## $ OFF_HOSPIT : chr "No" "Yes" "No" "No" ...
## $ CitNum : int 46424 44324 45126 43150 47307 46549 47555 44172 43723 47889 ...
## $ CitRace : chr "Black" "Hispanic" "Hispanic" "Hispanic" ...
## $ CitSex : chr "Female" "Male" "Male" "Male" ...
## $ CIT_INJURE : chr "Yes" "No" "No" "Yes" ...
## $ SUBJ_INJURE_DESC : chr "Non-Visible Injury/Pain" "No injuries noted or visible" "No injuries noted or visible" "Laceration/Cut" ...
## $ CIT_ARREST : chr "Yes" "Yes" "Yes" "Yes" ...
## $ CIT_INFL_A : chr "Mentally unstable" "Mentally unstable" "Unknown" "FD-Unknown if Armed" ...
## $ CitChargeT : chr "APOWW" "APOWW" "APOWW" "Evading Arrest" ...
## $ RA : int 2062 1197 4153 4523 2167 1134 2049 3122 2072 4403 ...
## $ BEAT : int 134 237 432 641 346 235 132 515 133 614 ...
## $ SECTOR : int 130 230 430 640 340 230 130 510 130 610 ...
## $ DIVISION : chr "CENTRAL" "NORTHEAST" "SOUTHWEST" "NORTH CENTRAL" ...
## $ DIST_NAME : chr "D14" "D9" "D6" "D11" ...
## $ STREET_N : int 211 7647 716 5600 4600 1234 511 4709 300 18600 ...
## $ STREET : chr "Ervay" "Ferguson" "bimebella dr" "LBJ" ...
## $ street_g : chr "N" "NULL" "NULL" "NULL" ...
## $ street_t : chr "St." "Rd." "Ln." "Frwy." ...
## $ Street.Address : chr "211 N ERVAY ST" "7647 FERGUSON RD" "716 BIMEBELLA LN" "5600 L B J FWY" ...
## $ City : chr "Dallas" "Dallas" "Dallas" "Dallas" ...
## $ State : chr "TX" "TX" "TX" "TX" ...
## $ Latitude : num 32.8 32.8 32.7 NA NA ...
## $ Longitude : num -96.8 -96.7 -96.9 NA NA ...
## $ SERVICE_TY : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ UOF_REASON : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ ForceType1 : chr "Hand/Arm/Elbow Strike" "Joint Locks" "Take Down - Group" "K-9 Deployment" ...
## $ ForceType2 : chr "" "" "" "" ...
## $ ForceType3 : chr "" "" "" "" ...
## $ ForceType4 : chr "" "" "" "" ...
## $ ForceType5 : chr "" "" "" "" ...
## $ ForceType6 : chr "" "" "" "" ...
## $ ForceType7 : chr "" "" "" "" ...
## $ ForceType8 : chr "" "" "" "" ...
## $ ForceType9 : chr "" "" "" "" ...
## $ ForceType10 : chr "" "" "" "" ...
## $ Cycles_Num : chr "NULL" "NULL" "NULL" "NULL" ...
## $ ForceEffec : chr " Yes" " Yes" " Yes" " Yes" ...
The columns like OCCURED_D,OCCURED_T are being read as character data types, We need to manually convert them into the type Date.
#Converting necessary columns into date type
df$OCCURRED_D <- mdy(df$OCCURRED_D)
#Extracting the monthas a seperate colum for future use
df$Month_of_INC <- months(as.Date(df$OCCURRED_D),abbreviate = TRUE)
df$MONTH <-format(df$OCCURRED_D,"%m")
#df$Month_of_INC #A new column in added to the df, which has the month information for the specific row
I have created a new column named Month_of_INC in the data frame. This column has the value of Month extracted from the OCCURED_D column. The purpose of doing that is to make it easy to plot maps going forward.
The data is now prepared for the analysis. Lets start by checking how many incidents happen on a monthly basis in the state.
#In order to make it easy to plot, I am grouping certain data together
monthly_incidents <- df %>%
group_by(MONTH) %>%
summarize(count =n())
#monthly_incidents
month_count <- ggplot(data=monthly_incidents, aes(x=MONTH, y=count, group=1)) +
geom_line()+
geom_point()+ scale_color_viridis(discrete = TRUE) +labs(x="Month", y="Count of Incidents") + ggtitle("Crimes per month") #scale_x_date(labels = date_format("%b"))
ggplotly(month_count)
The above graph shows the number of crimes per each month. * The incident rate peaked in the month of March with a total of 264 incidents followed closely by February with 254. * December recorder the lowest number of incidents.
Let’s have a look at the ratio of Male and Female officers in the force.
off_ratio_plot <- ggplot(df,aes(x=factor(OffSex)))+
geom_bar(position="dodge")+theme_minimal()+
labs(title="Gender ratio of officers in the force", x="Officer Gender", y = "Count")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.7),vjust=-0.2)
off_ratio_plot
The ratio of female officers in the force is drastically low compared to males. Females account for only about a 10% of the total force.
division_count <- df %>%
group_by(MONTH,DIVISION) %>%
summarize(count = n())
## `summarise()` has grouped output by 'MONTH'. You can override using the
## `.groups` argument.
div <- ggplot(division_count, aes(x = MONTH, y = count, group = 'count')) + geom_area( aes(color= DIVISION, fill=DIVISION), size = 1.2) +
facet_wrap(~ DIVISION,ncol=4) +theme_cleveland() +theme(legend.position="none") +
labs(x="Months ", y= "INCIDENT COUNTS", title=" Incidents per each Division")+theme(axis.text.x = element_text(angle = 90))
ggplotly(div)
The above plots show that the highest number of incidents occurred in the CENTRAL division and the lowest in the NORTHWEST. It’s safe to assume that NORTHWEST is the relatively safe compared to all the divisions. Also the months of February and March have the most number of incidents across all the divisions.
Let’s have a look at the race of the officers.
off_race_plot <- ggplot(df,aes(x=factor(OffRace), fill=OffRace))+
geom_bar(position="dodge")+scale_fill_brewer(palette = "Paired")+
labs(title="Ratio of Officer Race", x="Officer Race", y = "Count")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.7),vjust=-0.2)
off_race_plot
The majority of the officers are White, followed by Hispanic. American Ind is the least seen race in the officers with just 8 officers in total. Black officers account for the third highest race. Although this may not tell the whole story, it seems like the police force is mostly filled with White officers.
Let’s try to look if there are any patterns between the race of the race of the officers that dealt with each of the incidents.
oficer_race_vs_force = ggplot(df,aes(x = OffRace,fill = SERVICE_TY)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(size = 8,
angle = 90,
hjust = 1,
vjust = 1))+labs(x="Race of officer", y= "Count of force", title = "Race of the involved officers vs Incident Reason")
ggplotly(oficer_race_vs_force)
White officers are more involved in the incidents than any other race due to their ratio in the force. Here are few interesting observations: * Black officers are more involved in crowd control that white. Although the overall occurrence of crowd control is low, most of it was dealth by the black officers. * White officers are more involved in the incidents occurring due to suspicious activity. * Black officers are involved in the most number of Off_Duty Incidents with a total of 21 incidents.
subject_race_plot <- ggplot(df,aes(x=factor(CitRace), fill=CitRace))+
geom_bar(position="dodge")+scale_fill_brewer(palette = "Set2")+
labs(title="Ratio of Subject Race", x="Subject Race", y = "Count")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.7),vjust=-0.2)
subject_race_plot
The above graph shows that the Black people have committed more crimes in the region, followed by the Hispanic and then the White races. While the stats tell us that the black race has been responsible for more incidents, there is no data available to evaluate the authenticity of the incident.
Let’s look at the race and gender of the subjects.
race_to_sex_ratio <- ggplot(data = df, aes(x= CitRace, fill=CitSex)) + geom_bar(na.rm = TRUE)+theme_classic2()+scale_fill_brewer(palette = "Accent")+labs(y = "Count", x = "Suspect Race", title = "Ratio of Suspect Race to Gender")
ggplotly(race_to_sex_ratio)
Black Males and Females have been involved in more incidents than any other gender and race.
Lets try to visualize how many incidents have led to arrests.
arrests <- df %>%
group_by(OCCURRED_D, CIT_ARREST) %>%
summarize(count =n())
## `summarise()` has grouped output by 'OCCURRED_D'. You can override using the
## `.groups` argument.
ggplot(df,aes(x=CIT_ARREST, fill=CIT_ARREST))+
geom_bar(position="dodge")+scale_fill_brewer(palette = "Paired")+
labs(title="Incident vs Arrest Ratio", x="Suspect Arrested", y = "")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vjust=-0.2)
Out of a total of 2,383 reported incidents, 86% of the time, the subject was arrested.
Let’s take a look at the arrest ratio by race of the suspect
arrest_race_plot <- ggplot(df,aes(x=CIT_ARREST, fill=CitRace))+
geom_bar(position="dodge")+scale_fill_brewer(palette = "Set2")+
labs(title="Arrest Ratio by Race", x="Suspect Arrested", y = "Count")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vjust=-0.2)
arrest_race_plot
Black Males are the most arrested subjects followed by Male Hispanics.
Let’s plot the subject description to see what state the subject was in when they committed the incident.
suspect_description_plot <- ggplot(data = df, aes(x= CIT_INFL_A, fill=CIT_ARREST)) + geom_bar(alpha = 0.5)+theme_classic()+labs(y = "Count", x = "Subject Description", title = "Subject Description vs Arrest")+theme(axis.text.x = element_text(angle = 90), legend.position = "bottom" )
ggplotly( suspect_description_plot)
sus_des <- ggplot(data = df, aes(x= CIT_INFL_A, fill=CitRace)) + geom_bar(position = "fill")+theme_minimal()+labs(y = "", x = "", title = "Suspect Description Count")+theme(axis.text.x = element_text(angle = 90), legend.position = "bottom")
ggplotly(sus_des)
#suspect injurey by race
injured_race_plot <- ggplot(df,aes(x=CIT_INJURE, fill=CitRace))+
geom_bar(position="dodge")+scale_fill_brewer(palette = "Set1")+
labs(title="Injury Per Race", x="Subject Injured", y = "Count")+
geom_text(aes(label=..count..),stat='count',position=position_dodge(0.9),vjust=-0.2)
injured_race_plot
The injury rate seems to be relatively low compared to the total population.
Let’s plot the most common incidents reported in the region.
offences_types <- df %>%
group_by(CitChargeT) %>%
summarise(counts = n())
top_n(offences_types, n=10) %>%
ggplot(., aes(x=CitChargeT, y=counts))+
geom_bar(stat='identity')+theme(axis.text.x = element_text(angle = 90), legend.position = "None")
## Selecting by counts
Apoww is the highest incident reported in the region, followed by no arrest and public intoxication.
inc_vs_race = ggplot(df,aes(x = CitRace,fill = SERVICE_TY)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(size = 8,
angle = 90,
hjust = 1,
vjust = 1))
ggplotly(inc_vs_race)
Most of the incidents occurred as a result of Arrest followed by service call in blacks, whites and Hispanics.
race_vs_div = ggplot(df,aes(x = DIVISION,fill = CitRace)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(size = 8,
angle = 90,
hjust = 1,
vjust = 1))+labs(y = "", x = "Division", title = "Distribution of Race across Divisions")
race_vs_div
The leading population of incidents are black across all the regions, with the highest amount of blacks in the central region.
arrests_vs_race <- ggplot(df,aes(x=UOF_REASON, y=after_stat(count), fill=CitRace))+ geom_bar()+theme(axis.text.x = element_text(angle = 90), legend.position = "bottom")+labs(title="Different forces used on each race")
ggplotly(arrests_vs_race)
Let’s have a look at the race of subjects, tha are involved in the top most frequent incidents
race_top_inc <- ggplot(subset(df,CitChargeT %in% c("APOWW","Assault","Assault/FV","No Arrest","Public Intoxication","Warrant/Hold")) ,aes(x=CitChargeT, y=after_stat(count), fill=CitRace))+ geom_bar(position = "dodge")+labs(y = "", x = "Incident", title = "Races involved in top incidents")
ggplotly(race_top_inc)
top_inc_force <- ggplot(subset(df,CitChargeT %in% c("APOWW","Assault","Assault/FV","No Arrest","Public Intoxication","Warrant/Hold")) ,aes(x=CitChargeT, y=after_stat(count), fill=ForceType1))+ geom_bar(position = "dodge") + labs(x="Offence Type", y = "Count", title="Type of force used in each offence")+theme(axis.text.x = element_text(angle = 90))
ggplotly(top_inc_force)
Verbal command is the majority force used by the officers. In 97 cases of No Arrest a weapon display has occurred.
race_vs_force <- ggplot(subset(df,CitRace %in% c("Black","White","Hispanic")),aes(x=CitRace, y=after_stat(count), fill=ForceType1),color="black")+ geom_bar(position = "dodge", size=2)+theme_minimal()+labs(x="Race of subject", y="Count", title="Force used on different races")
ggplotly(race_vs_force)
Black suspects hare faced the highest number of verbal commands, tasers and weapon displays.
street_types <- ggplot(df,aes(x=street_t, y=after_stat(count), fill=street_t))+ geom_bar()+labs(x="Street Type", y="Count", title="Incidents across various street types")+theme(axis.text.x = element_text(angle = 90))
ggplotly(street_types)
Most number of incidents have occurred at “St.” followed by Driveways.
inc_type_per_div <- ggplot(subset(df,CitChargeT %in% c("APOWW","Assault","Assault/FV","No Arrest","Public Intoxication","Warrant/Hold")) ,aes(x=CitChargeT, y=after_stat(count), fill=DIVISION))+ geom_bar(position = "dodge")+ labs(x="Offence type", y="Count", title="Offence types across divisions")
ggplotly(inc_type_per_div)
Public intoxication is the highest in the central region, assault is highest in the southwest region
Plotting the street names that had the most incidents
top_street <- df %>%
group_by(STREET) %>%
summarise(counts = n())
top_n(top_street, n=10) %>%
ggplot(., aes(x=STREET, y=counts))+
geom_bar(stat='identity')+theme_bw()+labs(x="Street Name", y="Number of Incidents", title="Incident counts in each street")
## Selecting by counts
##
Commerce street has had the most number of incidents.
Let’s try to analyze the behaviors of officers that are new in the force vs seniors
off_experience <- ggplot(data=df,aes(x=INCIDENT_DATE_LESS_ , y= after_stat(count), fill=SERVICE_TY))+geom_bar()+labs(x="Years in force", y="Count", title = "Officer experience vs Incident Reason")
ggplotly(off_experience)
Young officers that have been in the force for less than 10 years are the most involved in incidents that resulted in arrests. They are also most likely to call for cover.
##Plotting incidents per regions and race
df2 <- df[!is.na(df$Latitude) & !is.na(df$Longitude),]
#head(df2)
area_vs_race <- mapview(df2, xcol = "Longitude", ycol = "Latitude", zcol="CitRace", crs = 4269, grid = FALSE, legend=TRUE, col.regions = palette("ggplot2"))
area_vs_race
The above map shows us all the incidents that have occurred in the region color coded by the race of the suspects.
The key observations from the analysis are:
The overall crime rate is on a declining pattern. In the month of Dec, the least number of incidents happened.
Central Division recorded the most number of incidents.
March of 2016 recorded the highest number of crimes.
Most of the incidents have been committed by the people belonging to the Black race, followed by Hispanic.
There is no conclusive evidence of racial discrimination by the police towards any particular race. Black race has generally had the highest number of incidents. There is no data available for the legitimacy of the arrests or other charges faced by the suspects.
Due to the high proportion of white officers in the force, most of the incidents were dealt by them. This means, most of the incidents where a black suspect was involved were dealt by white officers.
There is no conclusive proof of any racial discrimination in the state of Texas, according to the data of 2016. Although more data would have given us more insights.